mean_absolute_percentage_error (MAPE)#

Mean Absolute Percentage Error (MAPE) measures the average relative size of the errors:

“On average, how far off am I, relative to the true value?”

Because it is unitless, it’s common in forecasting and business settings.

Important caveat: MAPE is undefined when a true target is 0 (and unstable when targets are near 0).


Learning goals#

By the end you should be able to:

  • define MAPE precisely (and relate it to MAE)

  • build intuition for “relative error” with plots

  • implement MAPE from scratch in NumPy (weights + multi-output + safe handling of zeros)

  • optimize a simple linear regression model by minimizing MAPE with subgradient descent

  • understand pros/cons and when to use MAPE

Quick import#

from sklearn.metrics import mean_absolute_percentage_error

Prerequisites#

  • absolute error / residuals

  • basic linear regression notation

  • gradients / subgradients (helpful but not required)

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots

from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_absolute_percentage_error, mean_squared_error
from sklearn.model_selection import cross_val_score, train_test_split

pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")

rng = np.random.default_rng(7)

1) Definition#

Let \(y \in \mathbb{R}^n\) be targets and \(\hat{y} \in \mathbb{R}^n\) be predictions. Define the per-sample absolute percentage error (APE):

\[ \mathrm{APE}_i = \left|\frac{y_i - \hat{y}_i}{y_i}\right| \]

Then the mean absolute percentage error is:

\[ \mathrm{MAPE}(y, \hat{y}) = \frac{1}{n}\sum_{i=1}^{n} \mathrm{APE}_i \]

In practice we usually guard against division by zero and sign ambiguity by using:

\[ \mathrm{MAPE}(y, \hat{y}) = \frac{1}{n}\sum_{i=1}^{n} \frac{|y_i - \hat{y}_i|}{\max(\varepsilon, |y_i|)} \]

This is the definition used by scikit-learn. Note:

  • scikit-learn returns a relative value in \([0, \infty)\) (e.g. 0.23), not a percentage in \([0, 100]\)

  • to convert to “percent”, multiply by 100

# A tiny example
y_true = np.array([10.0, 100.0, 50.0, 25.0])
y_pred = np.array([12.0, 102.0, 55.0, 20.0])

ape = np.abs(y_true - y_pred) / np.abs(y_true)
mape = float(ape.mean())  # relative value (0.0 = perfect)

mape, 100 * mape, mean_absolute_percentage_error(y_true, y_pred)
(0.13, 13.0, 0.13)

2) Intuition: “2 units” means different things#

MAPE normalizes each absolute error by the true value.

  • an absolute error of 2 when the true value is 10 is a 20% error

  • the same absolute error of 2 when the true value is 100 is a 2% error

So MAPE tends to care a lot about getting small targets right.

y_true = np.array([10.0, 100.0])
y_pred = np.array([12.0, 102.0])  # same absolute error (2) in both cases

abs_err = np.abs(y_true - y_pred)
ape_pct = 100 * abs_err / y_true

fig = make_subplots(
    rows=1,
    cols=2,
    subplot_titles=("Absolute error (units)", "Absolute percentage error (%)"),
)
fig.add_trace(go.Bar(x=["y=10", "y=100"], y=abs_err, name="abs error"), row=1, col=1)
fig.add_trace(go.Bar(x=["y=10", "y=100"], y=ape_pct, name="% error", marker_color="#E45756"), row=1, col=2)

fig.update_yaxes(title_text="|y - ŷ|", row=1, col=1)
fig.update_yaxes(title_text="100×|y - ŷ| / y", row=1, col=2)
fig.update_layout(height=350, title="Same absolute error, very different relative error")
fig.show()

3) MAPE is a weighted MAE (and it’s scale-free)#

Rewrite the safe definition as:

\[ \mathrm{MAPE}(y, \hat{y}) = \frac{1}{n}\sum_{i=1}^{n} w_i\,|y_i - \hat{y}_i|\qquad\text{where}\qquad w_i = \frac{1}{\max(\varepsilon, |y_i|)} \]

So MAPE is just MAE with per-sample weights that are larger when \(|y_i|\) is small.

A key property is scale invariance:

\[ \mathrm{MAPE}(cy, c\hat{y}) = \mathrm{MAPE}(y, \hat{y})\qquad\text{for any }c>0 \]

(While MAE scales linearly with \(c\).)

y_true = rng.lognormal(mean=2.0, sigma=0.8, size=200)
y_pred = y_true * (1.0 + rng.normal(0.0, 0.15, size=y_true.size))

scales = np.array([0.1, 1.0, 10.0, 100.0])
mape_vals = []
mae_vals = []

for c in scales:
    mape_vals.append(mean_absolute_percentage_error(c * y_true, c * y_pred))
    mae_vals.append(mean_absolute_error(c * y_true, c * y_pred))

fig = go.Figure()
fig.add_trace(go.Scatter(x=scales, y=mape_vals, mode="lines+markers", name="MAPE (relative)"))
fig.add_trace(go.Scatter(x=scales, y=mae_vals, mode="lines+markers", name="MAE (units)", yaxis="y2"))

fig.update_xaxes(title_text="scale factor c", type="log")
fig.update_yaxes(title_text="MAPE", rangemode="tozero")
fig.update_layout(
    title="Scale invariance: MAPE stays the same when you scale the target",
    height=380,
    yaxis2=dict(title="MAE", overlaying="y", side="right", showgrid=False),
)
fig.show()

4) Loss shape: why small targets dominate#

For a single sample with \(|y|>0\), the MAPE loss as a function of prediction \(\hat{y}\) is:

\[ \ell(\hat{y}) = \frac{|y - \hat{y}|}{|y|} \]

This is a V-shape (like MAE), but its slope is \(1/|y|\).

  • if \(|y|\) is small, the V is steep → small absolute errors cause large percentage errors

  • if \(|y|\) is large, the V is shallow → the same absolute error counts less

y_values = [1.0, 10.0, 100.0]
yhat = np.linspace(-20, 140, 600)

fig = go.Figure()
for y in y_values:
    loss = np.abs(y - yhat) / np.abs(y)
    fig.add_trace(go.Scatter(x=yhat, y=loss, mode="lines", name=f"y={y:g}"))

fig.update_layout(
    title="Per-sample MAPE loss vs prediction (different true values)",
    xaxis_title="prediction ŷ",
    yaxis_title="|y - ŷ| / |y|",
    height=380,
)
fig.show()

5) Common pitfalls (and alternatives)#

Pitfalls#

  • Zero targets: if \(y_i = 0\) then \(|y_i - \hat{y}_i|/|y_i|\) is undefined. scikit-learn uses \(\max(\varepsilon, |y_i|)\), which turns division-by-zero into a huge number.

  • Near-zero targets: values close to 0 can dominate the average (because they get very large weights).

  • Negative targets: you can still compute a “relative error” with \(|y_i|\) in the denominator, but the word “percent” can be misleading.

Alternatives#

  • WAPE (weighted absolute percentage error): $\(\mathrm{WAPE} = \frac{\sum_i |y_i - \hat{y}_i|}{\sum_i |y_i|}\)$ (behaves better with zeros unless all targets are zero).

  • sMAPE (symmetric MAPE): $\(\mathrm{sMAPE} = \frac{1}{n}\sum_i \frac{2|y_i - \hat{y}_i|}{|y_i| + |\hat{y}_i|}\)$ (bounded, but still has edge cases and interpretability quirks).

  • MAE/RMSE when absolute error matters.

  • RMSLE when relative differences matter and \(y \ge 0\).

# How near-zero targets blow up the percentage error
eps = np.finfo(np.float64).eps

y = np.logspace(-6, 1, 500)  # from 1e-6 to 10
abs_err = 1.0

ape_no_guard = abs_err / y
ape_with_eps = abs_err / np.maximum(y, eps)

fig = go.Figure()
fig.add_trace(go.Scatter(x=y, y=ape_no_guard, mode="lines", name="1/|y|"))
fig.add_trace(go.Scatter(x=y, y=ape_with_eps, mode="lines", name="1/max(|y|, eps)", line=dict(dash="dash")))

fig.update_xaxes(title_text="|y|", type="log")
fig.update_yaxes(title_text="absolute percentage error", type="log")
fig.update_layout(title="Fixed absolute error (1.0) becomes huge when |y| is tiny", height=380)
fig.show()

6) NumPy implementation from scratch#

Below is a NumPy implementation that mirrors scikit-learn’s behavior:

  • supports 1D and multioutput targets

  • supports sample_weight (weights per sample)

  • supports multioutput aggregation ("raw_values", "uniform_average", or per-output weights)

  • uses \(\varepsilon = \mathrm{eps}(\text{float64})\) to avoid division by zero

def mean_absolute_percentage_error_np(
    y_true,
    y_pred,
    *,
    sample_weight=None,
    multioutput="uniform_average",
    epsilon=None,
):
    """NumPy implementation of scikit-learn's MAPE.

    Returns a relative value (e.g. 0.23), not a percent (23%).
    """
    y_true = np.asarray(y_true)
    y_pred = np.asarray(y_pred)

    if y_true.shape != y_pred.shape:
        raise ValueError(f"shape mismatch: y_true{y_true.shape} vs y_pred{y_pred.shape}")

    if epsilon is None:
        epsilon = np.finfo(np.float64).eps

    denom = np.maximum(np.abs(y_true), epsilon)
    mape = np.abs(y_pred - y_true) / denom

    # 1D
    if y_true.ndim == 1:
        if sample_weight is None:
            return float(mape.mean())

        w = np.asarray(sample_weight)
        if w.shape != (y_true.shape[0],):
            raise ValueError(f"sample_weight must have shape {(y_true.shape[0],)}, got {w.shape}")
        if np.any(w < 0):
            raise ValueError("sample_weight must be non-negative")
        return float(np.sum(w * mape) / np.sum(w))

    # 2D
    if y_true.ndim != 2:
        raise ValueError("y_true must be 1D or 2D")

    if sample_weight is None:
        output_errors = mape.mean(axis=0)
    else:
        w = np.asarray(sample_weight)
        if w.shape != (y_true.shape[0],):
            raise ValueError(f"sample_weight must have shape {(y_true.shape[0],)}, got {w.shape}")
        if np.any(w < 0):
            raise ValueError("sample_weight must be non-negative")
        output_errors = np.average(mape, axis=0, weights=w)

    if isinstance(multioutput, str):
        if multioutput == "raw_values":
            return output_errors
        if multioutput == "uniform_average":
            return float(np.mean(output_errors))
        raise ValueError("multioutput must be 'raw_values', 'uniform_average', or array-like")

    w_out = np.asarray(multioutput, dtype=float)
    if w_out.shape != (y_true.shape[1],):
        raise ValueError(f"multioutput weights must have shape {(y_true.shape[1],)}, got {w_out.shape}")
    if np.any(w_out < 0):
        raise ValueError("multioutput weights must be non-negative")

    return float(np.sum(w_out * output_errors) / np.sum(w_out))
# Quick checks vs scikit-learn

# 1D
y_true = rng.normal(size=60)
y_pred = y_true + rng.normal(0, 0.4, size=60)
y_true[[3, 15]] = 0.0  # include zeros to show eps-guard behavior

print("1D")
print("numpy :", mean_absolute_percentage_error_np(y_true, y_pred))
print("sklearn:", mean_absolute_percentage_error(y_true, y_pred))

# Multioutput
Y_true = rng.normal(size=(80, 3))
Y_pred = Y_true + rng.normal(0, 0.5, size=(80, 3))
w = rng.uniform(0.5, 2.0, size=80)

print("\nMultioutput (raw)")
print("numpy :", mean_absolute_percentage_error_np(Y_true, Y_pred, sample_weight=w, multioutput="raw_values"))
print("sklearn:", mean_absolute_percentage_error(Y_true, Y_pred, sample_weight=w, multioutput="raw_values"))

print("\nMultioutput (weighted)")
weights = np.array([0.2, 0.3, 0.5])
print("numpy :", mean_absolute_percentage_error_np(Y_true, Y_pred, sample_weight=w, multioutput=weights))
print("sklearn:", mean_absolute_percentage_error(Y_true, Y_pred, sample_weight=w, multioutput=weights))
1D
numpy : 189188988502492.4
sklearn: 189188988502492.4

Multioutput (raw)
numpy : [1.65270765 5.3344239  1.11974124]
sklearn: [1.65270765 5.3344239  1.11974124]

Multioutput (weighted)
numpy : 2.490739317496624
sklearn: 2.490739317496624

7) Using MAPE to fit a model (from scratch)#

MAPE is often used as an evaluation metric, but you can also use it as a training objective.

For a linear model:

\[ \hat{y} = Xw + b \]

the (safe) MAPE objective is:

\[ J(w, b) = \frac{1}{n}\sum_{i=1}^{n} \frac{|(Xw + b)_i - y_i|}{\max(\varepsilon, |y_i|)} \]

This is convex, but not differentiable everywhere (because of \(|\cdot|\)). A simple low-level optimizer is subgradient descent.

Let \(r = Xw + b - y\) and \(d_i = \max(\varepsilon, |y_i|)\). One valid subgradient is:

\[ \nabla_w J = \frac{1}{n} X^\top \left(\frac{\mathrm{sign}(r)}{d}\right)\qquad\qquad \frac{\partial J}{\partial b} = \frac{1}{n}\sum_{i=1}^n \frac{\mathrm{sign}(r_i)}{d_i} \]

where division is elementwise.

Interpretation: minimizing MAPE is equivalent to minimizing a weighted MAE with weights \(1/d_i\).

# Synthetic regression data with multiplicative (relative) noise
# This is a setting where a relative-error metric like MAPE often makes sense.
n = 500
x = rng.uniform(0, 200, size=n)
X = x[:, None]

# True linear relationship (targets are strictly positive)
y_clean = 20.0 + 5.0 * x  # range ~ [20, 1020]

# Multiplicative lognormal noise with mean 1
sigma = 0.35
mult = np.exp(rng.normal(loc=-0.5 * sigma**2, scale=sigma, size=n))
y = y_clean * mult

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=0)

X_train.shape, X_test.shape
((375, 1), (125, 1))
# Baseline: ordinary least squares (minimizes MSE)
ols = LinearRegression().fit(X_train, y_train)
y_pred_ols = ols.predict(X_test)

mape_ols = mean_absolute_percentage_error(y_test, y_pred_ols)
mae_ols = mean_absolute_error(y_test, y_pred_ols)
mse_ols = mean_squared_error(y_test, y_pred_ols)

(ols.intercept_, ols.coef_[0]), (mape_ols, mae_ols, mse_ols)
((12.796215955793969, 5.021166597032021),
 (0.3234497534673614, 139.99874478140572, 45111.330009565856))
def fit_linear_regression_mape_subgradient(X, y, *, lr0=500.0, n_iters=4000, epsilon=None):
    """Minimize MAPE for y_hat = X @ w + b using subgradient descent.

    Uses a decaying learning rate: lr_t = lr0 / sqrt(t+1).
    """
    X = np.asarray(X)
    y = np.asarray(y)
    n_samples, n_features = X.shape

    if epsilon is None:
        epsilon = np.finfo(np.float64).eps

    # Good starting point when w=0 for L1-type losses
    w = np.zeros(n_features)
    b = float(np.median(y))

    denom = np.maximum(np.abs(y), epsilon)
    history = np.empty(n_iters)

    for t in range(n_iters):
        y_hat = X @ w + b
        r = y_hat - y
        g = np.sign(r) / denom  # subgradient wrt y_hat

        grad_w = (X.T @ g) / n_samples
        grad_b = g.mean()

        lr = lr0 / np.sqrt(t + 1)
        w -= lr * grad_w
        b -= lr * grad_b

        y_hat2 = X @ w + b
        history[t] = np.mean(np.abs(y_hat2 - y) / denom)

    return w, b, history


# Feature scaling helps subgradient methods
x_mean = X_train.mean(axis=0)
x_std = X_train.std(axis=0)

X_train_s = (X_train - x_mean) / x_std
X_test_s = (X_test - x_mean) / x_std

w_mape, b_mape, hist = fit_linear_regression_mape_subgradient(X_train_s, y_train)

# Convert parameters back to the original x scale:
# y_hat = w_s * ((x - mu)/sigma) + b_s = (w_s/sigma) * x + (b_s - w_s*mu/sigma)
slope_mape = w_mape[0] / x_std[0]
intercept_mape = b_mape - slope_mape * x_mean[0]

y_pred_mape = intercept_mape + slope_mape * X_test[:, 0]

mape_mape = mean_absolute_percentage_error(y_test, y_pred_mape)
mae_mape = mean_absolute_error(y_test, y_pred_mape)
mse_mape = mean_squared_error(y_test, y_pred_mape)

(intercept_mape, slope_mape), (mape_mape, mae_mape, mse_mape)
((23.369710958357132, 3.5535421606512725),
 (0.28248252581867633, 162.72868717407727, 75607.15743363052))
fig = px.line(
    y=100 * hist,
    title="Subgradient descent: MAPE objective vs iteration",
    labels={"index": "iteration", "y": "train MAPE (%)"},
)
fig.show()
# Fit visualization: data + fitted lines
x_line = np.linspace(X.min(), X.max(), 250)

y_line_true = 20.0 + 5.0 * x_line
y_line_ols = ols.intercept_ + ols.coef_[0] * x_line
y_line_mape = intercept_mape + slope_mape * x_line

fig = go.Figure()
fig.add_trace(go.Scatter(x=X_test[:, 0], y=y_test, mode="markers", name="test data", marker=dict(size=6, opacity=0.55)))
fig.add_trace(go.Scatter(x=x_line, y=y_line_true, mode="lines", name="true line", line=dict(color="green", dash="dash")))
fig.add_trace(go.Scatter(x=x_line, y=y_line_ols, mode="lines", name=f"OLS (MSE) | test MAPE={100*mape_ols:.1f}%"))
fig.add_trace(go.Scatter(x=x_line, y=y_line_mape, mode="lines", name=f"MAPE-trained | test MAPE={100*mape_mape:.1f}%", line=dict(color="#E45756")))

fig.update_layout(title="Optimizing MAPE shifts the fit toward relative accuracy", height=420)
fig.update_xaxes(title_text="x")
fig.update_yaxes(title_text="y")
fig.show()
# Distribution of absolute percentage errors on the test set
eps = np.finfo(np.float64).eps

ape_ols = np.abs(y_test - y_pred_ols) / np.maximum(np.abs(y_test), eps)
ape_mape = np.abs(y_test - y_pred_mape) / np.maximum(np.abs(y_test), eps)

fig = go.Figure()
fig.add_trace(go.Box(y=100 * ape_ols, name="OLS (MSE)", boxpoints="outliers"))
fig.add_trace(go.Box(y=100 * ape_mape, name="MAPE-trained", boxpoints="outliers"))

fig.update_yaxes(title_text="absolute % error", type="log")
fig.update_layout(title="Test absolute percentage errors (%)", height=380)
fig.show()

8) Practical usage (scikit-learn)#

MAPE is commonly used for evaluation and model selection.

from sklearn.metrics import mean_absolute_percentage_error
mean_absolute_percentage_error(y_true, y_pred)

For cross-validation, scikit-learn follows a “bigger is better” convention and exposes MAPE as negative MAPE:

cross_val_score(model, X, y, scoring="neg_mean_absolute_percentage_error")
scores = cross_val_score(
    LinearRegression(),
    X,
    y,
    scoring="neg_mean_absolute_percentage_error",
    cv=5,
)

# Convert back to positive MAPE and percent
mape_cv = -scores
float(mape_cv.mean()), float((100 * mape_cv).mean())
(0.3169360284034368, 31.693602840343686)

Pros, cons, and when to use MAPE#

Pros#

  • Interpretable: “average relative error” is easy to communicate

  • Unitless / scale-free: comparable across targets measured in different units or at different scales

  • Natural for multiplicative noise: when you care about proportional errors (10% off is 10% off)

Cons#

  • Undefined at 0 and unstable near 0 (can explode and dominate the mean)

  • Over-weights small targets: optimization behaves like weighted MAE with weights \(1/|y|\)

  • Awkward with negatives: the “percent” interpretation breaks down if \(y\) can be negative

Good use cases#

  • forecasting demand/sales/traffic where values are strictly positive and not too close to 0

  • comparing performance across multiple series with different scales

  • business metrics where relative deviation matters more than absolute units

Avoid when#

  • targets can be 0 or near 0 (consider MAE/RMSE, WAPE, or sMAPE)

  • you care about absolute units (e.g. “within ±5 kWh”)

  • targets can be negative and “percent” becomes ambiguous

Exercises#

  1. Implement WAPE and compare it to MAPE on a dataset with many zeros.

  2. Show that minimizing MAPE is equivalent to minimizing MAE with per-sample weights \(w_i = 1/\max(\varepsilon, |y_i|)\).

  3. For a constant predictor \(\hat{y}_i \equiv c\), experiment numerically with which \(c\) minimizes MAPE (hint: weighted median).

References#

  • scikit-learn API: https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_absolute_percentage_error.html

  • Hyndman & Athanasopoulos, Forecasting: Principles and Practice (accuracy measures): https://otexts.com/fpp3/accuracy.html